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ABSTRACT 

Objective: To conduct a fully independent and external 
validation of a research study based on one electronic 
health record database, using a different electronic 
database sampling the same population. 
Design: Using the Clinical Practice Research Datalink 
(CPRD), we replicated a published investigation into the 
effects of statins in patients with ischaemic heart disease 
(IHD) by a different research team using QResearch. We 
replicated the original methods and analysed all-cause 
mortality using: (1 ) a cohort analysis and (2) a case- 
control analysis nested within the full cohort. 
Setting: Electronic health record databases containing 
longitudinal patient consultation data from large numbers 
of general practices distributed throughout the UK. 
Participants: CPRD data for 34 925 patients with IHD 
from 224 general practices, compared to previously 
published results from QResearch for 13 029 patients 
from 89 general practices. The study period was from 
January 1996 to December 2003. 
Results: We successfully replicated the methods of the 
original study very closely. In a cohort analysis, risk of 
death was lower by 55% for patients on statins, 
compared with 53% for QResearch (adjusted HR 0.45, 
95% CI 0.40 to 0.50; vs 0.47, 95% CI 0.41 to 0.53). In 
case-control analyses, patients on statins had a 31% 
lower odds of death, compared with 39% for QResearch 
(adjusted OR 0.69, 95% CI 0.63 to 0.75; vs OR 0.61, 
95% CI 0.52 to 0.72). Results were also close for 
individual statins. 

Conclusions: Database differences in population 
characteristics and in data definitions, recording, quality 
and completeness had a minimal impact on key statistical 
outputs. The results uphold the validity of research using 
CPRD and QResearch by providing independent evidence 
that both datasets produce very similar estimates of 
treatment effect, leading to the same clinical and policy 
decisions. Together with other non-independent 
replication studies, there is a nascent body of evidence 
for wider validity. 



Strengths and limitations of this study 



Previous comparisons of electronic health record 
(EHR) databases have compared different patient 
populations or have not been done by independ- 
ent researchers. This is the first fully independ- 
ent validation of a published EHR-based study 
using a different EHR database sampling from 
the same underlying population. 
Estimates obtained from Clinical Practice 
Research Datalink (CPRD) for the treatment 
effects of statins on mortality in patients with 
ischaemic heart disease (IHD) were remarkably 
similar to those from QResearch, providing a 
degree of reassurance for clinicians, researchers 
and policy-makers that findings using either PCD 
would be essentially the same. 
There were some demographic and other differ- 
ences between the CPRD and QResearch IHD 
cohorts. Sensitivity analysis indicated that these 
had only a minimal effect on the results. 
We were able to successfully replicate nearly all 
the elements of the original QResearch study 
using CPRD, but this would not have been pos- 
sible without some input on methodological 
detail from the authors of the original study. 
The results add to evidence for the wider validity 
of the UK primary care databases, but cannot be 
generalised to EHRs in other countries where the 
data quality may be quite different. 



INTRODUCTION 

In recent years, electronic patient health 
records have emerged as an important new 
tool for medical research. Numbers of 
research publications based on analyses of 
electronic record databases have increased 
rapidly, and government-sponsored research 
networks — such as the Observational Medical 
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Outcomes Partnership (OMOP) in the US and 
Canadian Network for Observational Drug Effect 
Studies (CNODES) — have been estabhshed to advance 
research based on electronic records. 

Globally, some of the largest and most detailed 
sources of electronic patient data are the UK 'primary 
care databases' (PCDs), some of which contain detailed 
data on all primary care consultations for millions of 
patients and span more than two decades. The three 
major UK PCDs are the CPRD^ (Clinical Practice 
Research Datalink; formerly the general practice 
research database (GPRD)), QResearch^ and The 
Health Improvement Network^ (THIN). These PCDs, 
and CPRD in particular, are used as a resource by 
researchers throughout the world. A search on PubMed 
reveals that the combined number of articles based on 
data from these three PCDs published during 2000 was 
41; during 2012, it was 172. 

Despite the growing use of electronic datasets as 
research tools, there remain concerns about the validity 
of studies based on such data, including uncertainties 
around data quality, data completeness and the potential 
for bias from measured and unobserved confounders. 
The validity of coding in the UK PCDs has received a 
considerable amount of attention in the literature,^"^ in 
particular the completeness of recording of consulta- 
tions and disease diagnoses. The former has been found 
to be generally high,^ whereas the coding of diagnoses 
varies considerably by condition.^ The validity of data on 
risk factors — such as blood pressure and cholesterol — 
has received less attention,^"^^ but is of particular 
importance for PCD-based effectiveness studies, where 
selection bias and non-random missing data have the 
potential to produce misleading conclusions.^ 

An alternative approach to assessing the validity of 
PCD-based studies is to compare the results with those 
obtained from equivalent investigations conducted on 
other independent datasets. This approach subsumes 
concerns about the validity and completeness of the 
underlying individual data values within the broader 
question of whether such flaws in the data make any dif- 
ference to the ultimate conclusions drawn from the ana- 
lysis. A number of studies have compared results 
obtained from PCDs with those from pre-existing rando- 
mised controlled trials (RCTs) of the same interven- 
tion. However, the use of RCTs as a gold standard 
forjudging the validity of PCD-based studies is question- 
able: results may differ due to the very different contexts 
of RCTs and observational data; for example, drug 
regimes, patient characteristics and use of additional or 
combination medications may differ radically in the real- 
world setting. Another approach to validating a 
PCD-based study is to replicate the study in one or more 
completely independent PCDs. Agreement of results, 
although not proving validity, would imply that the con- 
clusions do not depend on the data source. This is the 
approach we take in the present paper. Several studies 
have applied the same design protocol to more than 



one database. In most cases, these databases have 
covered different countries, different geographical 
areas of the same country^^ or different patient popu- 
lations within a country,^"^ and also different kinds of 
databases (eg, administrative claims data vs electronic 
health records^"^). Some studies have reported consistent 
findings across databases; others have found varying 
results. In a study examining the heterogeneity of effect 
estimates for 53 drug-outcome pairs across 10 US data- 
bases (either claims data or electronic health records) 
using two different study designs, only nine pairs were 
consistent across databases in direction and statistical sig- 
nificance, with up to 19 pairs having effect estimates 
ranging from significantly increased risk to significantly 
decreased risk. 

With these studies, it is not possible to determine 
whether the heterogeneity of results is due to differ- 
ences in data recording and quality between databases, 
or to differences in demographics and health between 
the covered populations. To address this, comparisons 
are required involving databases that sample from the 
same underlying patient population. A few studies fall 
into this category, all based on UK PCDs. Bremner et a^^ 
examined early-life exposure to antibacterials and subse- 
quent development of hay fever by running identical 
analyses on the CPRD and the Doctors' Independent 
Network (DIN-LINK). The two child cohorts proved 
similar in all essentials and results of case-control ana- 
lysis were similar for both, even extending to a signifi- 
cant association between antibacterial exposure and 
development of hay fever disappearing after adjustment 
for a number of consultations. Vinogradova et a^^ exam- 
ined the relationship between exposure to bisphospho- 
nates and gastrointestinal cancers using QResearch and 
CPRD data. They reported the two patient samples to be 
similar in demographics, risk factors, comorbidities and 
use of medications, and found no significant associations 
between bisphonosphonate use and various types of 
cancers in either database. 

Both of these studies, although informative, were 
however conducted by research groups instrumental in 
the development of the comparative PCD (DIN-LINK 
and QResearch, respectively) and so lacked independ- 
ence. The only fully independent studies are a series 
of external validation studies using the THIN 
database and risk prediction tools originally developed 
using QResearch (eg, QRISK, QRISK2, QKidney, 
QStatin).^^ These studies applied the risk algo- 

rithms previously derived using QResearch on patients 
in THIN, and reported mostly good discriminative and 
calibration properties. However, these studies did not 
address the question of whether analysis of the two data- 
bases would result in the same at-risk algorithm itself. In 
this paper, we report what we believe to be the first com- 
pletely independent full replication of a published 
research study based on one PCD, using a different PCD 
that samples from the same population. Our overall aim 
was to assess the extent to which the model parameters 
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(in this case, treatment effects) derived using one PCD 
would be identical to those derived using the other 
PCD. We also examine a different clinical topic and 
outcome from these previous replications. 

METHODS 

CPRD and THIN obtain their data from practices using 
the Vision electronic record system, while QResearch 
obtains data from practices using EMIS software. We felt 
that comparisons would be most informative between 
databases drawing data from different capture systems. 
Across the time-period studied, two versions of EMIS 
were in use, the more common^^ being the text-based 
EMIS LV system with navigation and data entry mainly 
via the keyboard; EMIS PCS, which is Windows-based 
with mouse control and drop-down menus, was intro- 
duced from 1999. Vision was Windows-based throughout 
the study period. A small-scale direct comparison of 
EMIS LV and Vision indicated that coded data entry, 
excepting prescribing information, was faster with Vision 
and that more items were likely to be coded. Practices 
running Vision have slightly higher achievement rates 
for most Quality and Outcomes Framework (QOF) indi- 
cators than practices running either version of EMIS, 
even after controlling for differences in practice and 
area characteristics.^^ We had access to CPRD, and there- 
fore chose to replicate a study previously conducted 
using QResearch. CPRD and QResearch both draw data 
from general practices spread throughout the UK — cur- 
rently more than 600 practices each — and comparisons 
to the national age-gender structure and prevalence 
rates for common conditions mostly show good corres- 
pondence for both datasets.^^ For practical reasons, 
we focused on studies of the effectiveness of medicinal 
interventions and, after assessing the available studies, 
chose to replicate an investigation into the effects of 
statins on the mortality of patients with ischaemic heart 
disease (IHD) by Hippisley-Cox and Coupland 
(H-C&C).^^ The methodological details provided in the 
published paper were insufficient on their own to allow 
a close replication to be conducted, and we therefore 
obtained additional details from the authors. We 
requested purely factual information about the methods 
used and did not share any of our analyses or results. 

We replicated the methods of H-C&C as closely as pos- 
sible, given the differences between the two databases. 
All of the methods described below, including the study 
period, variable specifications and analytical procedures, 
are exact replications of those used in the original study, 
unless indicated otherwise. We selected all practices in 
CPRD that provided up to standard (UTS) data (UTS is 
CPRD's designation for data meeting their internal 
quality standards) for the whole of the period from 1 
January 1996 to 17 December 2003. We next identified 
all patients with a first diagnosis of IHD within this 
period, based on the QOF business rules for 2004.^^ We 
excluded patients whose IHD diagnosis fell within the 



first 3 months of registration with their general practice 
or was on or subsequent to their recorded date of death, 
or who were prescribed statins prior to first diagnosis. 

We extracted data for these patients from the date of 
IHD diagnosis up until 17 December 2003, or until the 
date of death or exit from the practice, or the last 
recorded date for practices that stopped providing data 
before 17 December 2003, giving a maximum possible 
length of follow-up postdiagnosis of just under 8 years. 

Analysis 

The main outcome was all-cause mortality, identified 
through a record of death in the CPRD. Following 
H-C&C, we conducted two main analyses: (1) a cohort 
analysis and (2) a case-control analysis nested within the 
full cohort. All analyses were conducted using R.^^ 
Following H-C&C, statistical significance was assessed 
using p<0.01 (two tailed), but 95% CIs are reported in 
tables and figures. 

We made an a priori decision not to attempt to 
'improve' on the analysis conducted by H-C&C, as our 
specific aim was to determine whether the same results 
and conclusions would emerge from using identical 
methods on a different underlying dataset targeting the 
same population. 

Cohort analysis 

The cohort analysis used a Cox proportional hazards 
model to examine the effect of statin use on patient sur- 
vival, with survival time determined by the time (in days) 
between the date of first diagnosis and date of death. 
Patients who transferred out of their practice before death 
or who were still alive at the end of the study period were 
treated as censored observations. Statin exposure was used 
as a time-varying covariate, with the period of exposure 
from the date of first prescription to when the statin was 
stopped (estimated as the date of last prescription plus 
90 days; intervening breaks in the use of statins were 
ignored), or if not stopped until the end of the study 
period, date of death or date of transfer out of practice. 
Covariates adjusted for in the analysis were year of diagno- 
sis, gender, comorbidities (diabetes, hypertension, myocar- 
dial infarction, congestive cardiac failure and cancer), and 
age (coded as 0-44, 45-54, 55-64, 65-74, 75-84, 85-94 or 
>95), smoking (ever smoked, never smoked, not 
recorded) and body mass index (BMI; coded as <25, 25- 
30, >30 kg/m^) all at the date of diagnosis. The presence 
of each comorbidity was indicated by a diagnosis in the 
patient record (using the 2004 QOF business rules) and 
coded as present/not present at the date of IHD diagnosis. 
If smoking status or BMI was not recorded within 4 years 
prior to diagnosis of IHD, we coded it as missing. 

The analysis was undertaken using the R survival ana- 
lysis package accounting for the clustering of patients by 
practice and using the Huber-White robust estimate of 
SE. The proportional hazards assumption was checked 
graphically and with a test for proportional hazards. 
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Nested case-control study 

The nested case-control analysis compared all patients 
from the cohort who died during the follow-up period 
(the cases) with a group of matched control patients 
(also with IHD) who did not die. For each case, we 
defined an 'index date' as the date of death. We then 
used an incidence density sampling procedure (as per 
the original study; personal correspondence) to ran- 
domly select four control patients for each case matched 
on gender, year of IHD diagnosis and age (coded in 
5-year age-bands). General practice was not used as a 
matching variable. Controls were patients with IHD alive 
at the time their matched case died (including patients 
who themselves became cases at a later time-point) . The 
incidence sampling procedure allowed the same patient 
to be selected as a control for more than one case, thus 
providing a full set of four controls for each case, while 
still producing unbiased estimates of risk.^^ 

Statin exposure was based on the first and last pre- 
scription dates prior to the index date and coded into: 
(1) currently taking statins (last prescription was within 
90 days of the index date); (2) previously took statins 
(last prescription more than 90 days prior to the index 
date) and (3) has never taken statins. We did this for all 
statins as a group and also separately for five different 
types of statin (atorvastatin, cerivastatin, fluvastatin, pra- 
vastatin and simvastatin). For 'all statins', the last pre- 
scription could be for a different statin type than the 
first; for individual statins, it had to be the same type. 
One further formulation, rosuvastatin, was in use that 
did not appear in the QResearch study. We included this 
in the 'all statins' group but did not analyse it individu- 
ally as only 22 patients had received the statin. 

Analysis of the case-control study used conditional 
logistic regression accounting for the matching of cases 
with controls, to obtain ORs for the risk of death in rela- 
tion to use of statins. We allowed for clustering by 
general practice and used a robust estimate of SE, in 
line with the cohort analysis. Covariates in the analysis 
were smoking status, BMI and comorbidities, specified 
as in the Cohort analysis but based on the index date 
rather than the date of diagnosis. Additional covariates 
in this analysis were the Townsend deprivation score for 
the practice postcode (in national quintiles; H-C&C 
used quintiles of patient-level Townsend scores) and use 
of P-blockers, aspirin, ACE inhibitors and calcium 
channel blockers, identified through the British 
National Formulary^^ chapter codes in the patient 
record. Each medication was coded as either used or 
not used prior to the index date but after the date of 
IHD diagnosis. Interactions between use of statins and 
each of gender, age (less than 75 vs 75 and over) and 
diabetes were tested by adding interaction terms into 
the model. 

Sensitivity analysis 

To replicate the original sensitivity analyses, we reran the 
case-control study: (1) while excluding patients without 



recorded values for BMI or smoking; (2) with the 
sample restricted to patients without a diagnosis of 
cancer; (3) with the sample restricted to patients 
without diabetes, congestive cardiac failure or myocar- 
dial infarction and (4) using only those cases who sur- 
vived for at least 1 year after diagnosis of IHD and their 
matched controls. The definitions of death and depriv- 
ation were different in CPRD and to assess sensitivity to 
this we repeated the cohort and case-control analyses 
with the analyses restricted to practices for which 
patient-level Office of National Statistics (ONS) official 
death dates and Townsend scores were available (58% of 
practices and 60% of patients). 

Our primary analysis replicated H-C&C in restricting 
the sample to only those practices with data for the full 
8-year period. However, inclusion criteria for 
CPRD-based studies are generally patient-based rather 
than practice-based, and include all individual patients 
with UTS data for the analysis period (ie, from diagnosis 
date to end of study, death or transfer out of practice), 
and on this basis 61 458 patients from 577 CPRD prac- 
tices could be included. We therefore repeated the main 
analyses using this sample. 

RESULTS 

Comparison of patient cohorts 

A higher number of practices contributed data to the 
CPRD cohort: 224 compared with 89 for QResearch, 
resulting in a total sample of patients with a first diagno- 
sis of IHD in the study period, after exclusions, of 34 925 
compared with 13 029 (table 1; note that if the original 
study were undertaken now, additional practices (with 
their retrospective data) added to QResearch since 2006 
would produce a more equivalently sized cohort). 
Incidence cases per practice were considerably higher 
for CPRD (on average, 242 compared with 190), possibly 
implying that the included CPRD practices were gener- 
ally larger, though a smaller proportion of CPRD 
patients met the study inclusion criteria (64.4% vs 77%). 
H-C&C provided descriptive statistics for only certain 
covariates, and reported these in person-years rather 
than counts. Total person-years of observations in CPRD 
were 125 709 compared with 43 460 in QResearch. 
QResearch included a greater proportion of person- 
years from older patients (36.3% from patients aged 75 
or over compared with 28.1%) and had a much higher 
representation of congestive cardiac failure (14% com- 
pared with 6.0%), but less hypertension (28.9% com- 
pared with 35.9%). These figures imply some 
demographic differences between the cohorts. 

Table 2 reports mortality rates from the two studies by 
various patient characteristics. Age-band specific mortal- 
ity rates were higher in CPRD for all age-bands except 
the youngest (0-44 years), although, owing to differing 
age distributions, the overall mortality rates were very 
similar (53.5/1000 person-years for CPRD compared 
with 52.1 for QResearch). Mortality was slightly higher 
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Table 1 Descriptive statistics of the CPRD and QResearch samples VI^^^^^^^^^^II^HIiHIIIHili^^ll^l 




CPRD 




QResearch 




Number of general practices in the study 


224 




89 1 




All patients with a first diagnosis of IHD during the study period 


C /I OH^ /0/10\ 

54 217 (242) 




16 920 (190) 




(incidence rate per practice) 










Number (%) of incident cases meeting the inclusion criteria 


o4 y2o (d4.4%) 




H o r\oo /"7"7 r\o/ \ 
\0 U2y (77.0%) 




End of study status 










Died 


6725 (19.3%) 




2266 (1 7.4%) 




Alive 


24 292 (69.6%) 




9609 (73.8%) 




Left before end of study 


3908 (1 1 .2%) 




1 154 (8.9%) 




Total person years of observation 


1 25 709 




43 460 






Person years 


% 


Person years 


% 


Age band (years) 










0-44 


4997 


4.0 


824 


1.9 


45-54 


16 506 


13.1 


3923 


9.C 


55-64 


30 431 


24.2 


9270 


21 .c 


65-74 


38 365 


30.5 


13 636 


31.^ 


75-84 


28 082 


22.3 


1 1 827 


27.2 


85-94 


7073 


5.6 


3744 






255 


0.2 


2v35 


0 ^ 


Women 


57 169 


45.5 


18 539 


42.7 


Men 


Do o4U 


04.0 


OA Qon 


O/ .V 


No diabetes 


114 949 


91.4 


39 814 


91 .e 


Diabetes 


10 760 


8.6 


3646 


8.^ 


No hypertension 


80 574 


64.1 


30 912 


71.1 


Hypertension 


45 136 


35.9 


12 547 


28.C 


No congestive cardiac failure 


118 209 


94.0 


37 391 


86.0 


Congestive cardiac failure 


7501 


6.0 


6069 


14.0 


CPRD, Clinical Practice Research Datalink; IHD, ischaemic heart disease. 











for women than for men in both datasets. In both Survival analysis 

cohorts, diabetes and congestive cardiac failure were The Kaplan-Meier survival curve, uncontrolled for any 
associated with greatly increased death rates. covariates (figure 1), shows a clear survival advantage 



Table 2 Comparison of mortality rates in the Clinical Practice Research Datalink (CPRD) and QResearch cohorts: eligible 
patients only 





CPRD 








QResearch 






Person 


Number of 


Rate/1000 




Rate/1000 




Cohort 


years 


deaths 


person-years 


95% CI 


person-years 


95% CI 


All 


125 709 


6725 


53.5 


52.3 to 54.8 


52.1 


50 to 54.3 


Age band (years) 












0-44 


4997 


44 


8.8 


6.5 to 11.9 


9.7 


4.9 to 19.4 


45-54 


16 506 


202 


12.2 


10.6 to 14.1 


10.2 


7.5 to 13.9 


55-64 


30 431 


638 


21.0 


19.4 to 22.7 


16.8 


14.4 to 19.7 


65-74 


38 365 


1639 


42.7 


40.7 to 44.8 


32.8 


29.9 to 36 


75-84 


28 082 


2638 


93.9 


90.6 to 97.4 


77.0 


72.2 to 82.2 


85-94 


7073 


1457 


206.0 


196.6 to 215.6 


167.2 


154.6 to 180.8 


>95 


255 


107 


419.8 


359.0 to 483.1 


331.4 


265.4 to 413.7 


Women 


57 169 


3116 


54.5 


52.7 to 56.4 


54.1 


50.9 to 57.6 


Men 


68 540 


3609 


52.7 


51 .0 to 54.4 


50.7 


48 to 53.6 


No diabetes 


114 949 


5835 


50.8 


49.5 to 52.1 


49.7 


47.5 to 51.9 


Diabetes 


10 760 


890 


82.7 


77.6 to 88.1 


79.0 


70.4 to 88.7 


No 


80 574 


4241 


52.6 


51.1 to 54.2 


50.8 


48.3 to 53.4 


hypertension 














Hypertension 


45 136 


2484 


55.0 


53.0 to 57.2 


55.5 


51 .5 to 59.8 


No CCF 


118 209 


5410 


45.8 


44.6 to 47.0 


41.4 


39.3 to 43.5 


CCF 


7501 


1315 


175.3 


166.8 to 184.2 


118.6 


110.3 to 127.6 


CCF, congestive cardiac failure. ^^^^^^ ^^^^^^ ^^^^^^^^ 
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n \ \ r 

0 2 4 6 8 

Time since diagnosis of IHD (Years) 

Figure 1 Kaplan-Meier plot showing survival of patients 
taking statins compared with patients not taking statins, 
uncontrolled for covariates; CPRD study. 

for patients taking statins, with a raw HR of 0.25 (95% 
CI 0.23 to 0.27; table 3). At 6 years, the unadjusted sur- 
vival rate for patients taking statins was 89%, versus 63% 
for those not taking statins, remarkably similar to the 
reported values of 89% and 66% for the QResearch 
cohort. 

The Cox proportional hazards model (adjusted for 
the covariates gender, age, year of diagnosis, diabetes, 
hypertension, congestive cardiac failure, myocardial 
infarction, cancer, BMI and smoking) significantly 
departed from the proportional hazards assumption 
when year of diagnosis was specified as a continuous 
variable (p<0.01). Respecifying year of diagnosis as a 
stratification variable resolved this problem. The 
adjusted HR was 0.45 (95% CI 0.40 to 0.50), very close 
to the adjusted HR for QResearch patients of 0.47 (95% 
CI 0.41 to 0.53) and representing a 55% lower risk of 
death for patients on statins. 

Case-control study 

The case-control analysis included 6683 cases and 26 732 
controls (for 42 cases, we were unable to find matching con- 
trols) . The cases and controls were well matched in terms of 
median age, gender and duration of IHD (table 4). They 
were also very close in these respects to the patients in the 
QResearch study. Compared with QResearch, slightly lower 
percentages of cases and controls had received a prescrip- 
tion for statins (16.6% vs 19.6% and 22.8% vs 25.4%, 
respectively), though there were smaller differences in the 
percentages taking them for more than 1 year For com- 
pleteness, online supplementary table A provides a compari- 
son of the CPRD cases and controls on the unmatched 
covariates in the analysis (not available for QResearch). 
These show a good to acceptable balance on all covariates. 



Table 3 Unadjusted and adjusted HRs (95% CI) of risk of 
death for patients taking statins connpared with patients not 
taking statins; Clinical Practice Research Datalink (CPRD) 
and QResearch studies 





CPRD 


QResearch 


Unadjusted 
Adjusted 


0.25 (0.23 to 0.27) 
0.45 (0.40 to 0.50) 


Not reported 
0.47 (0.41 to 0.53) 



Patients who were currently on a statin had a signifi- 
cantly decreased rate of death compared with patients 
who had never taken a statin in unadjusted (OR=0.57, 
95% CI 0.53 to 0.62) and adjusted (OR=0.69, 95% CI 
0.63 to 0.75) analyses (table 5). These ORs were very 
close to those reported for QResearch (unadjusted 
OR=0.53, 95% CI 0.46 to 0.61; adjusted OR=0.61, 95% 
CI 0.52 to 0.72; figure 2). Patients who were previously, 
but not currently, on a statin had a significantly elevated 
risk of death in unadjusted and adjusted analysis, 
whereas H-C&C found a significantly increased risk in 
unadjusted analysis only. 

In the case of individual statins, the QResearch study 
reported significant protective effects against risk of 
death for current use of atorvastatin and simvastatin; 
with CPRD we found the same, but also a protective 
effect for pravastatin, which just failed to reach 1% sig- 
nificance in QResearch (p=0.013). For all five statins, 
our current use OR point estimates were similar to those 
from QResearch, though the larger CPRD sample pro- 
duced narrower CIs. Like the original study, we found 
no significant effects in patients who were previously, but 
not currently, on any individual statin, despite the 
increased risk for all statins combined. 



Effects of age, sex and diabetes on the effectiveness 
of statins 

Like H-C&C, we found no evidence for an interaction 
between gender and statin use (p=0.84), or diabetes and 
statin use (p=0.62). Unlike H-C&C, however, we did find 
a significant interaction with age (p<0.001), with an 
adjusted OR of 0.55 (95% CI 0.48 to 0.62) for people 
aged less than 75 and 0.77 (95% CI 0.65 to 0.92) for 
those aged 75 or over, indicating greater benefit for 
those under 75 years of age. 



Results for sensitivity analyses 

Repeating the sensitivity analyses from the original study, 
we found that restricting the case-control sample to 
patients with BMI and smoking status information, or to 
those without cancer, or without a diagnosis of diabetes, 
congestive cardiac failure or myocardial infarction made 
very little difference to our results. Restricting the 
sample to patients alive for at least 1 year after diagnosis 
of IHD likewise made little difference (see online 
supplementary figure A) . 

Restricting the CPRD sample to practices for which 
patient-level Townsend scores and ONS official death 
dates were available made no appreciable difference to 
the results of either the cohort (HR 0.47, 95% CI 0.40 to 
0.54) or case-control analyses (see online supplementary 
figure B). Similarly, widening the sample to include all 
patients with UTS data made little difference (cohort 
study HR=0.43, 95%CI=0.39 to 0.47; for Case-Control ana- 
lysis, see online supplementary figure C). 
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Table 4 Comparison of cases and controls for CPRD 
and QResearch studies 

CPRD QResearch 

Number of patients 



Cases 


6683 


2266 


Controls 


26 732 


9064 


Median age in years 






Cases 


80 


80 


Controls 


80 


80 


Male (%) 






Cases 


53.6 


55.7 


Controls 


53.6 


55.7 


Median duration of IHD (months) 




Cases 


21.3 


20.3 


Controls 


21.4 


21.0 


N (%) prescribed statins 




Cases 


1108 (16.6) 


445 (19.6) 


Controls 


6083 (22.8) 


2303 (25.4) 



Of those prescribed statins, N (%) taking them for 
>12 months 

Cases 572 (51.6) 228 (51.2) 
Controls 3398 (55.9) 1336 (58) 



CPRD, Clinical Practice Research Datalink; IHD, ischaemic heart 
DISCUSSION 

We conducted an independent replication of a primary 
care database-based study using a different primary care 
database, sampling from the same population. We repli- 
cated the methods of the original study as closely as we 
could and reached exactly the same clinical conclusions 
concerning the effects of statins on mortality in all their 
essentials. Not only that, our point estimates for the key 
statistical parameters — the HR from the cohort analysis, 
and the ORs from the nested case-control study — ^were 
remarkably similar to those reported by the original study. 
For the period under study, CPRD provided a much larger 
sample than was included in the original QResearch study. 



LIMITATIONS 

While we were able to exactly replicate nearly all the ele- 
ments of the original study, there were a few minor dif- 
ferences due mainly to data specifications. The datasets 
may have differed in the way in which all-cause mortality 
is defined, as both use their own bespoke algorithm. For 
area deprivation, QResearch used Townsend quintile 
deprivation scores at the patient-level, whereas these 
scores were fully available in CPRD only at the practice 
level, and for only 60% of the cohort at the patient-level. 
We tested for the impact of these factors by running a 
sensitivity analysis using the subset of CPRD patients for 
which linked ONS data on the date of death and resi- 
dential Townsend scores were available. 



FINDINGS 

We observed a number of differences between the 
CPRD and QResearch cohorts, in particular eligibility 
rates, age distributions and some comorbidities; hence, 
there may have been some population differences 
between the cohorts. Although both PCDs purport to 
provide nationally representative coverage of the UK 
population, only subsets of the practices in each were 
included (those providing data for the whole study 
period) and it may be that these differed in coverage: 
for the time-period of the study QResearch included 
more practices from the South and East of the UK^^ 
than our CPRD cohort, whereas the latter included 
higher concentrations of practices from London, the 
North West and Scotland. It is also possible that record- 
ing differences between the Vision and EMIS computer 
software systems^^ resulted in differential coding of some 
comorbidities. 

Despite these differences, all of our key results using 
CPRD were very close to those based on QRresearch. 
The cohort analysis returned remarkably similar values 



Table 5 Unadjusted and adjusted ORs comparing cases and controls by type and use of statins, CPRD study ^^^^^^^ 




Unadjusted OR 




Adjusted OR 








(compared with 




(compared with 








never used) 


95% CI 


never used) 


95% CI 


p Value 


Used previously 












Any statin 


1.43 


1 .24 to 1 .65 


1.4 


1 .21 to 1 .63 


<0.001 


Atorvastatin 


1.15 


0.92 to 1 .45 


1.18 


0.94 to 0.50 


0.158 


Cerivastatin 


0.59 


0.43 to 0.83 


0.67 


0.48 to 0.94 


0.02 


Fluvastatin 


0.77 


0.50 to 1.19 


0.89 


0.56 to 1 .40 


0.611 


Pravastatin 


1.03 


0.78 to 1 .36 


1.13 


0.85 to 1 .50 


0.398 


Simvastatin 


1.02 


0.87 to 1.19 


1.07 


0.91 to 1 .26 


0.385 


Used currently 












Any statin 


0.57 


0.53 to 0.62 


0.69 


0.63 to 0.75 


<0.001 


Atorvastatin 


0.55 


0.48 to 0.64 


0.66 


0.57 to 0.77 


<0.001 


Cerivastatin 


0.48 


0.30 to 0.77 


0.59 


0.36 to 0.96 


0.0336 


Fluvastatin 


0.50 


0.33 to 0.76 


0.65 


0.43 to 0.99 


0.0468 


Pravastatin 


0.60 


0.49 to 0.73 


0.68 


0.56 to 0.84 


<0.001 


Simvastatin 


0.64 


0.58 to 0.71 


0.78 


0.70 to 0.87 


<0.001 
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to H-C&C for the statin use HR, both with and without 
control for covariates. Likewise, the nested case-control 
study yielded a very similar estimate for the protective 
effect of current statin use. Also, like H-C&C, we found 
separate protective effects for current users of atorvasta- 
tin and simvastatin. Another formulation, pravastatin, 
was found to be protective using CPRD but just failed to 
reach 1% significance in the QResearch study, the differ- 
ence most likely being due to the larger CPRD sample. 

In both studies, evidence for an elevated risk of death 
for patients who were previously, but not currently, on 
statins was somewhat at odds with the results for individ- 
ual statins. However, cerivastatin was withdrawn from the 
market in 2001, in the middle of the study period,^^ 
which may have had a complex impact on these results, 
particularly since cerivastatin users are likely to have 
switched to a different statin on removal. To examine 
the impact of discontinuation of cerivastatin, we 
repeated the case-control analysis using only the data 
prior to 1 January 2001. The resulting all-statins OR was 
no longer statistically significant and in greater accord 
with those for individual statins (see online 
supplementary figure D). 



Findings from meta-analyses of a large number of 
RCTs leave little doubt that statins do indeed benefit 
patients with IHD.^^ However, the more than 50% 
reductions in mortality risk in CPRD and QResearch are 
very much greater than the reductions of 20-30% 
reported in major trials,^^"^^ or the overall reduction of 
17% from meta-analysis of 92 trials. One possibility is 
that these observational studies are biased by unmeas- 
ured confounding factors, but another is that RCTs 
might substantially underestimate the benefit of statins 
in the actual population of users. 

However, our intention in conducting the research 
was more methodological than clinical: to establish 
whether analyses of different PCDs would lead to the 
same overall clinical conclusions. To this end, we kept all 
aspects of the analysis as constant as possible except for 
the PCD itself. The closeness of the results suggests that 
any variations between the datasets in population 
characteristics, data definitions, data quality and com- 
pleteness had only a very minimal impact on the key stat- 
istical outputs: the HRs and ORs that are the main 
parameters used to inform clinical and policy decision- 
making. The few differences in statistical significance 
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were principally attributable to the considerably larger 
size of the CPRD sample. 

Our results also demonstrate that PCD-based studies 
can be successfully and independently replicated in 
other PCDs. However, this was only possible with the 
cooperation of H-C&C, as the original paper did not 
include the necessary methodological detail: for 
example, the Read and other codes used to define IHD 
and other morbidities; how drug exposures were mea- 
sured; the precise specification of each covariate; and 
the method used to select matched controls. Such 
absence of methodological detail is near ubiquitous 
throughout the field and at least partly attributable to 
journal restrictions on paper length. Most scientific jour- 
nals now allow supplementary material to be published 
online alongside the main paper, and we would encour- 
age researchers to publish their full methods online, 
whenever possible, and journal editors to encourage 
this. To facilitate this, we have setup an online code-list 
repository."^^ 

Our results provide a degree of reassurance about the 
validity of PCD-based studies, at least in terms of 
research undertaken using CPRD and/or QResearch. 
Together with Vinogradova et aFs replication^^ of a dif- 
ferent clinical topic, the findings suggest that these two 
PCDs produce estimates of treatment effects that are 
substantially the same. Combined with replication 
studies comparing CPRD and DIN-LINK,^^ there IS a 
nascent body of evidence for wider validity. We also note 
that whereas previous replications concerned null (non- 
significant) findings, the present study is evidence for 
successful replication of a positive intervention effect, 
which is arguably a stronger test of agreement. However, 
we emphasise that this paper has addressed validity only 
in the sense of consistency of statistical results, not the 
accuracy of the effect estimates relative to some 'true' 
value or the validity of the clinical conclusions drawn 
from these: analyses from both PCDs could conceivably 
be biased in the same direction, due to unmeasured 
factors common to both or limitations in the analysis 
methods themselves. 

Nevertheless, further replication studies similar to ours 
are needed. PCDs are used to address a wide range of dif- 
ferent kinds of research questions, using a great variety of 
designs and analytical methods, and replications of 
studies based around other forms of research design 
would be particularly informative. Our study used UK 
PCDs, which are generally acknowledged to be of higher 
quality and completeness than databases available for 
most countries, and we would urge researchers in other 
countries to undertake similar comparison studies. 
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